Dark Mode
LINEA
LINEA offers a few useful features to make modelling quicker, simpler and more accurate. This page covers a basic implementation of the features below:
We will run simple models on some fictitious data sourced from Google trends. The aim of this exercise will be to demonstrate the use of the features above.
we start by importing linea and some other useful libraries.
library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization
The output of the linea::decomp_chart() function can be grouped based on a data.frame mapping variables to categories and specific operations (i.e. max and min). This helps simplify the visualization and provide focus on specific groups of variables. Lets start by looking at a non-aggregated, variable decomposition.
First, we import some data…
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/ecomm_data.csv'
data = read_xcsv(file = data_path)
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
…and run a model.
dv = 'ecommerce'
ivs = c('christmas','covid','black.friday','offline_media')
model = data %>%
run_model(dv = dv,
ivs = ivs)
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -22738.0 -4713.4 -4.6 4550.7 21995.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.642e+04 5.486e+02 102.849 < 2e-16 ***
## christmas 2.913e+02 2.523e+01 11.546 < 2e-16 ***
## covid 3.014e+02 1.606e+01 18.775 < 2e-16 ***
## black.friday 2.796e+02 3.791e+01 7.374 2.29e-12 ***
## offline_media 5.538e+00 6.509e-01 8.507 1.51e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7038 on 256 degrees of freedom
## Multiple R-squared: 0.7752, Adjusted R-squared: 0.7717
## F-statistic: 220.8 on 4 and 256 DF, p-value: < 2.2e-16
Now we can plot our variable decomposition.
model %>%
decomp_chart(variable_decomp = T)
Now lets create a categories data.frame to group the ‘christmas’ and ‘black.friday’ variables together.
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media')
)
model = run_model(
data = data,
dv = dv,
ivs = ivs,
categories = categories,
id_var = 'date' # specify horizontal axis
)
model %>%
decomp_chart(variable_decomp = F)
The ‘christmas’ and ‘black.friday’ variables are derived from Google trends, which captures the impact of these events over time. As there is always a level of search for these keywords throughout the year, the series never reaches zero. Using the calc column of the categories data.frame we can tell linea to add this minimum value of search to the intercept, isolating the impact of the variation of the variable.
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media'),
calc = c('min','none','min','none')
)
model = run_model(
data = data,
dv = dv,
ivs = ivs,
categories = categories,
id_var = 'date' # specify horizontal axis
)
model %>%
decomp_chart(variable_decomp = F)
While the model above captures some of the variation from our ecommerce variable, there is still a lot left unexplained. Using a date column, of data-type date, we can generate seasonality variables with linea::get_seasonality(). Several columns will be added to the original data.frame. These are mainly dummy variables that capture some basic holidays as well as year, month, and week number. Also a trend variable is added which is a column that goes form 1 to n, where n is the number of rows.
data = data %>%
get_seasonality(
date_col_name = 'date',
date_type = 'weekly ending')
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
plot_ly(data) %>%
add_bars(y = ~ week_26,
x = ~ date,
name = 'week_26',
color = color_palette()[1]) %>%
add_bars(y = ~ new_years_eve,
x = ~ date,
name = 'new_years_eve',
color = color_palette()[2]) %>%
add_bars(y = ~ year_2019,
x = ~ date,
name = 'year_2019',
color = color_palette()[3]) %>%
layout(yaxis = list(title = 'value'),
title = 'Seasonality Variables',
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
These variables can be used in the model to capture the seasonal component of the dependent variable, among other things (e.g. trend).
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec')
model = run_model(data = data,
dv = dv,
ivs = ivs,
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -20899.1 -3149.9 -871.3 2667.1 20500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.822e+04 7.763e+02 62.115 < 2e-16 ***
## christmas 2.546e+02 3.201e+01 7.955 5.93e-14 ***
## covid 1.482e+02 1.738e+01 8.525 1.38e-15 ***
## black.friday 2.713e+02 3.215e+01 8.438 2.47e-15 ***
## offline_media 5.609e+00 5.098e-01 11.003 < 2e-16 ***
## trend 8.142e+01 6.384e+00 12.753 < 2e-16 ***
## month_Dec 1.573e+03 2.083e+03 0.755 0.451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5510 on 254 degrees of freedom
## Multiple R-squared: 0.8633, Adjusted R-squared: 0.8601
## F-statistic: 267.3 on 6 and 254 DF, p-value: < 2.2e-16
Thanks to the new variables this model has a better r-squared (~86%) compared to the previous. The impact of these variables can be seen clearly using the linea::decomp_chart() function.
model %>%
decomp_chart()
To simplify this visualization it is worth using categories, as demonstrated previously.
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media','Base','seasonality'),
calc = c('min','none','min','none','none','none')
)
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
model %>% decomp_chart()
While the model is improving, thanks to the seasonal variables introduced, selecting which variable could be a good fit for the model can be tricky and tedious.
df = model %>% what_next()
## Warning: model object does not contain 'meta_data'.
df %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
As shown above, the linea::what_next() function generates a data.frame where each row represents a variable in our data, and the impact it would have on our model in terms of:
We can now quickly see which variables are more likely to benefit the model.
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51')
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','seasonality','media','Base','seasonality','covid','seasonality'),
calc = c('min','none','min','none','none','none','none','none')
)
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14353.5 -2856.5 -891.7 2910.0 20611.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.765e+04 6.658e+02 71.572 < 2e-16 ***
## christmas 3.112e+02 2.947e+01 10.560 < 2e-16 ***
## covid 1.910e+02 1.575e+01 12.132 < 2e-16 ***
## black.friday 2.483e+02 2.777e+01 8.940 < 2e-16 ***
## offline_media 4.756e+00 4.441e-01 10.709 < 2e-16 ***
## trend 8.429e+01 5.472e+00 15.404 < 2e-16 ***
## month_Dec 2.243e+03 1.782e+03 1.259 0.209
## year_2021 -1.210e+04 1.599e+03 -7.567 7.19e-13 ***
## week_51 -1.625e+04 2.612e+03 -6.219 2.07e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4707 on 252 degrees of freedom
## Multiple R-squared: 0.901, Adjusted R-squared: 0.8979
## F-statistic: 286.7 on 8 and 252 DF, p-value: < 2.2e-16
model %>% decomp_chart()
The model is getting better and better, with an adjusted R squared almost reaching 90%. This doesn’t mean it can’t be improved further! Google Trends can be a very useful source of data as Google search volumes are often correlated with events and can be used as a proxy for a missing variable. The function linea::gt_f() will return the original data.frame with the added Google trends variable.
data = data %>%
gt_f(kw = 'ramadan',append = T) %>%
gt_f(kw = 'trump',append = T) %>%
gt_f(kw = 'prime day',append = T) %>%
gt_f(kw = 'amazon workers',append = T)
data %>%
datatable(options = list(scrollX = T),rownames = NULL)
plot_ly(data) %>%
add_lines(y = ~ gtrends_ramadan,
x = ~ date,
name = 'gtrends_ramadan',
color = color_palette()[1]) %>%
add_lines(y = ~ gtrends_trump,
x = ~ date,
name = 'gtrends_trump',
color = color_palette()[2]) %>%
add_lines(y = ~ `gtrends_prime day`,
x = ~ date,
name = 'gtrends_prime day',
color = color_palette()[3]) %>%
layout(yaxis = list(title = 'value'),
title = 'Google Trend Variables',
plot_bgcolor = "rgba(0, 0, 0, 0)",
paper_bgcolor = "rgba(0, 0, 0, 0)")
Now that these variables are part of our data, we can use the linea::what_next() function to see if they can be added to the model.
df = model %>% what_next(data = data)
## Warning: model object does not contain 'meta_data'.
df %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
As shown from the table above, the new variable, gtrends_prime day, seems like a sensible addition to the model.
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
summary(model)
##
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14716.6 -2740.3 -684.6 2764.0 14925.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.735e+04 6.104e+02 77.569 < 2e-16 ***
## christmas 3.173e+02 2.696e+01 11.766 < 2e-16 ***
## covid 1.934e+02 1.440e+01 13.429 < 2e-16 ***
## black.friday 2.534e+02 2.541e+01 9.973 < 2e-16 ***
## offline_media 4.688e+00 4.063e-01 11.537 < 2e-16 ***
## trend 8.181e+01 5.017e+00 16.306 < 2e-16 ***
## month_Dec 2.481e+03 1.630e+03 1.522 0.129
## year_2021 -1.152e+04 1.465e+03 -7.865 1.10e-13 ***
## week_51 -1.653e+04 2.389e+03 -6.917 3.82e-11 ***
## gtrends_prime day 1.771e+02 2.498e+01 7.091 1.35e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4305 on 251 degrees of freedom
## Multiple R-squared: 0.9175, Adjusted R-squared: 0.9146
## F-statistic: 310.3 on 9 and 251 DF, p-value: < 2.2e-16
Using the variable decomposition we can see the new variable is nicely fitting that July peak.
model %>% decomp_chart(variable_decomp = T)
The model has an R squared greater than 90% and be presented in a more polished way using categories and other charting functions.
ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')
categories = data.frame(
variable = ivs, # variables from the model
category = c('seasonality','covid','retail events','media','Base','seasonality','covid','seasonality','retail events'),
calc = c('min','none','min','none','none','none','none','none','none')
)
model = run_model(data = data,
categories = categories,
dv = dv,
ivs = ivs,
id_var = 'date')
model %>%
decomp_chart()
model %>% fit_chart()
Another feature available relates to Panel Data and Pooled Models. linea’s pooling functionality will divide the dependent variable by the mean of each group (pool, panel, region, etc…). When the coefficients are then multiplied by that same mean, we get a scaled coefficient for each group.
Lets start by looking at some pooled data. As we can see, the data below, generated again through Google trends, has a non-numeric variable, country.
data_path = 'https://raw.githubusercontent.com/paladinic/data/main/pooled%20data.csv'
data = read_xcsv(file = data_path)
data %>%
datatable(rownames = NULL,
options = list(scrollX = TRUE))
dv = 'amazon'
ivs = c('christmas','rakhi','diwali')
id_var = 'Week'
model = run_model(data = data,
dv = dv,
ivs = ivs,
id_var = id_var,
normalise_by_pool = T)
model %>%
decomp_chart()
In the decomposition above, the model’s decomposition is simply aggregated, while still considering the re-scaled coefficients. The visualization functions, such as the linea::decomp_chart() function, allow to filter the visualization based on the pool, as shown below.
model %>%
decomp_chart(pool = 'UK')
model %>%
fit_chart(pool = 'India')
model %>%
response_curves()
The Getting Started page is a good place to start learning how to build linear models with linea.
The Advanced Features page shows how to implement the features of linea that allow users to capture non-linear relationships.
The Additional Features page all other functions of the library.